{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part of Speech Tagging"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Part of speech tagging task aims to assign every word/token in plain text a category that identifies the syntactic functionality of the word occurrence.\n",
    "\n",
    "Polyglot recognizes 17 parts of speech, this set is called the `universal part of speech tag set`:\n",
    "\n",
    "- **ADJ**: adjective\n",
    "- **ADP**: adposition\n",
    "- **ADV**: adverb\n",
    "- **AUX**: auxiliary verb\n",
    "- **CONJ**: coordinating conjunction\n",
    "- **DET**: determiner\n",
    "- **INTJ**: interjection\n",
    "- **NOUN**: noun\n",
    "- **NUM**: numeral\n",
    "- **PART**: particle\n",
    "- **PRON**: pronoun\n",
    "- **PROPN**: proper noun\n",
    "- **PUNCT**: punctuation\n",
    "- **SCONJ**: subordinating conjunction\n",
    "- **SYM**: symbol\n",
    "- **VERB**: verb\n",
    "- **X**: other"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Languages Coverage"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The models were trained on a combination of:\n",
    "\n",
    "- Original CONLL datasets after the tags were converted using the [universal POS tables](http://universaldependencies.github.io/docs/tagset-conversion/index.html).\n",
    "\n",
    "- Universal Dependencies 1.0 corpora whenever they are available."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  1. German                     2. Italian                    3. Danish                   \n",
      "  4. Czech                      5. Slovene                    6. French                   \n",
      "  7. English                    8. Swedish                    9. Bulgarian                \n",
      " 10. Spanish; Castilian        11. Indonesian                12. Portuguese               \n",
      " 13. Finnish                   14. Irish                     15. Hungarian                \n",
      " 16. Dutch                    \n"
     ]
    }
   ],
   "source": [
    "from polyglot.downloader import downloader\n",
    "print(downloader.supported_languages_table(\"pos2\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Download Necessary Models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[polyglot_data] Downloading package embeddings2.en to\n",
      "[polyglot_data]     /home/rmyeid/polyglot_data...\n",
      "[polyglot_data]   Package embeddings2.en is already up-to-date!\n",
      "[polyglot_data] Downloading package pos2.en to\n",
      "[polyglot_data]     /home/rmyeid/polyglot_data...\n",
      "[polyglot_data]   Package pos2.en is already up-to-date!\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "polyglot download embeddings2.en pos2.en"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We tag each word in the text with one part of speech."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from polyglot.text import Text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "blob = \"\"\"We will meet at eight o'clock on Thursday morning.\"\"\"\n",
    "text = Text(blob)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can query all the tagged words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(u'We', u'PRON'),\n",
       " (u'will', u'AUX'),\n",
       " (u'meet', u'VERB'),\n",
       " (u'at', u'ADP'),\n",
       " (u'eight', u'NUM'),\n",
       " (u\"o'clock\", u'NOUN'),\n",
       " (u'on', u'ADP'),\n",
       " (u'Thursday', u'PROPN'),\n",
       " (u'morning', u'NOUN'),\n",
       " (u'.', u'PUNCT')]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text.pos_tags"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After calling the pos_tags property once, the words objects will carry the POS tags."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "u'PRON'"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text.words[0].pos_tag"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Command Line Interface"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "which           DET  \r\n",
      "India           PROPN\r\n",
      "beat            VERB \r\n",
      "Bermuda         PROPN\r\n",
      "in              ADP  \r\n",
      "Port            PROPN\r\n",
      "of              ADP  \r\n",
      "Spain           PROPN\r\n",
      "in              ADP  \r\n",
      "2007            NUM  \r\n",
      ",               PUNCT\r\n",
      "which           DET  \r\n",
      "was             AUX  \r\n",
      "equalled        VERB \r\n",
      "five            NUM  \r\n",
      "days            NOUN \r\n",
      "ago             ADV  \r\n",
      "by              ADP  \r\n",
      "South           PROPN\r\n",
      "Africa          PROPN\r\n",
      "in              ADP  \r\n",
      "their           PRON \r\n",
      "victory         NOUN \r\n",
      "over            ADP  \r\n",
      "West            PROPN\r\n",
      "Indies          PROPN\r\n",
      "in              ADP  \r\n",
      "Sydney          PROPN\r\n",
      ".               PUNCT\r\n",
      "\r\n"
     ]
    }
   ],
   "source": [
    "!polyglot --lang en tokenize --input testdata/cricket.txt |  polyglot --lang en pos | tail -n 30"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### Citation\n",
    "\n",
    "This work is a direct implementation of the research being described in the [Polyglot: Distributed Word Representations for Multilingual NLP](http://www.aclweb.org/anthology/W13-3520) paper.\n",
    "The author of this library strongly encourage you to cite the following paper if you are using this software."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "```\n",
    "   @InProceedings{polyglot:2013:ACL-CoNLL,\n",
    "     author    = {Al-Rfou, Rami  and  Perozzi, Bryan  and  Skiena, Steven},\n",
    "     title     = {Polyglot: Distributed Word Representations for Multilingual NLP},\n",
    "     booktitle = {Proceedings of the Seventeenth Conference on Computational Natural Language Learning},\n",
    "     month     = {August},\n",
    "     year      = {2013},\n",
    "     address   = {Sofia, Bulgaria},\n",
    "     publisher = {Association for Computational Linguistics},\n",
    "     pages     = {183--192}, \n",
    "     url       = {http://www.aclweb.org/anthology/W13-3520}\n",
    "   }\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References\n",
    "\n",
    "- [Universal Part of Speech Tagging](http://universaldependencies.github.io/docs/u/pos/index.html)\n",
    "- [Universal Dependencies 1.0](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1464)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}